trust region-guided proximal policy optimization
Reviews: Trust Region-Guided Proximal Policy Optimization
The paper proposes to adapt the clipping procedure of Proximal Policy Optimization (PPO) such that the lower and upper bounds are no longer constant for all states. The authors show that constant bounds cause convergence to suboptimal policies if the initial policy is initialized poorly (e.g. the probability of choosing optimal actions is small). As an alternative, the authors propose to compute state-action specific lower and upper bounds that are inside the trust region with respect to the previous policy. If the previous policy assigns a small probability to a given action, the lower and upper bounds do not need to be very tight, allowing for less agressive clipping. The adapted version of PPO, which the authors call TRGPPO, has provably better performance bounds than PPO and is validated empirically in several experiments.
Trust Region-Guided Proximal Policy Optimization
Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well. Extensive experiments verify the advantage of the proposed method.
Trust Region-Guided Proximal Policy Optimization
Wang, Yuhui, He, Hao, Tan, Xiaoyang, Gan, Yaozhong
Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well.